In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling
نویسندگان
چکیده
Entity resolution (ER) presents unique challenges for evaluation methodology. While crowd sourcing provides a platform to acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates of population parameters. This paper addresses this important challenge with the OASIS algorithm. OASIS draws samples from a (biased) instrumental distribution, chosen to have optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on regions providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 75% without loss to estimate accuracy.
منابع مشابه
Asymptotic properties of the sample mean in adaptive sequential sampling with multiple selection criteria
We extend the method of adaptive two-stage sequential sampling toinclude designs where there is more than one criteria is used indeciding on the allocation of additional sampling effort. Thesecriteria, or conditions, can be a measure of the targetpopulation, or a measure of some related population. We developMurthy estimator for the design that is unbiased estimators fort...
متن کاملAn Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches
Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...
متن کاملOptimal Capacitor Allocation in Radial Distribution Networks for Annual Costs Minimization Using Hybrid PSO and Sequential Power Loss Index Based Method
In the most recent heuristic methods, the high potential buses for capacitor placement are initially identified and ranked using loss sensitivity factors (LSFs) or power loss index (PLI). These factors or indices help to reduce the search space of the optimization procedure, but they may not always indicate the appropriate placement of capacitors. This paper proposes an efficient approach for t...
متن کاملOptimal SIR algorithm vs. fully adapted auxiliary particle filter: a non asymptotic analysis
Particle filters (PF) and auxiliary particle filters (APF) are widely used sequential Monte Carlo (SMC) techniques. In this paper we comparatively analyse, from a non asymptotical point of view, the Sampling Importance Resampling (SIR) PF with optimal conditional importance distribution (CID) and the fully adapted APF (FA). We compute the (finite samples) conditional second order moments of Mon...
متن کاملoASIS: Adaptive Column Sampling for Kernel Matrix Approximation
Computing with large kernel or similarity matrices is essential to many state-ofthe-art machine learning techniques in classification, clustering, and dimensionality reduction. The cost of forming and factoring these kernel matrices can become intractable for large datasets. We introduce an an adaptive column sampling technique called Accelerated Sequential Incoherence Selection (oASIS) that sa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 10 شماره
صفحات -
تاریخ انتشار 2017